Skip to content

fix: arithmetic overflow in mDNS TTL implementation #5967

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

momoshell
Copy link

Description

This PR tries to fix possible arithmetic overflows in the mDNS implementation when processing response packets with tremendous TTL values.

Fixes #5943

Notes & open questions

I’ve kept the source in its original form and added a new filter combinator. This feels like a small, minimal change and, in my opinion, the cleanest solution.

Change checklist

  • I have performed a self-review of my own code
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works
  • A changelog entry has been made in the appropriate crates

@momoshell momoshell changed the title Fixed arithmetic overflow and added unit test fix: arithmetic overflow in mDNS TTL implementation Mar 31, 2025
Copy link
Member

@elenaf9 elenaf9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two nits, rest LGTM. Thanks!

Comment on lines 181 to 182
.filter(move |peer| peer.id() != &local_peer_id)
.filter(move |peer| now.checked_add(peer.ttl()).is_some())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.filter(move |peer| peer.id() != &local_peer_id)
.filter(move |peer| now.checked_add(peer.ttl()).is_some())
.filter(move |peer| peer.id() != &local_peer_id && peer.ttl() > now)

No need for multiple Filter wrappers.

Copy link
Member

@elenaf9 elenaf9 Apr 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, just saw that my above suggestion doesn't make any sense. I meant:

Suggested change
.filter(move |peer| peer.id() != &local_peer_id)
.filter(move |peer| now.checked_add(peer.ttl()).is_some())
.filter(move |peer| peer.id() != &local_peer_id && now.checked_add(peer.ttl()).is_some())

Still, I am not a big fan of calculation the now + peer.ttl() sum twice. Why not use ffilter_map and return the now.checked_add(peer.ttl()) Option so it can be used directly in the next step?

@momoshell
Copy link
Author

momoshell commented Apr 4, 2025

Hey hey, thank you for your responses.
After reviewing the proposed changes, I realized something. It might be better not to filter out peers with large TTLs. A more effective solution could be to cap the new_expiration value to a reasonable limit, such as one day or smth.

// Use a reasonable maximum - 1 day should be sufficient for most scenarios
const MAX_TTL: Duration = Duration::from_secs(60 * 60 * 24);  

pub(crate) fn extract_discovered(
        &self,
        now: Instant,
        local_peer_id: PeerId,
    ) -> impl Iterator<Item = (PeerId, Multiaddr, Instant)> + '_ {
        self.discovered_peers()
            .filter(move |peer| peer.id() != &local_peer_id)
            // Remove this filter to keep all peers
            // .filter(move |peer| now.checked_add(peer.ttl()).is_some())
            .flat_map(move |peer| {
                let observed = self.observed_address();
                // Cap expiration time to avoid overflow
                let new_expiration = match now.checked_add(peer.ttl()) {
                    Some(expiration) => expiration,
                    None => {
                        now.checked_add(MAX_TTL)
                            .unwrap_or(now) // Fallback to now if even that overflows (extremely unlikely)
                    }
                };
                
                peer.addresses().iter().filter_map(move |address| {
                    let new_addr = _address_translation(address, &observed)?;
                    let new_addr = new_addr.with_p2p(*peer.id()).ok()?;
                    Some((*peer.id(), new_addr, new_expiration))
                })
            })
    }

@momoshell
Copy link
Author

On the other hand, we're still not solving the essence of this problem.
And it seems to me that there might be issues while decoding the MdnsPacket::new_from_bytes.
But I guess this would require much larger efforts on the whole subject.

@elenaf9
Copy link
Member

elenaf9 commented Apr 5, 2025

After reviewing the proposed changes, I realized something. It might be better not to filter out peers with large TTLs.

Can you expand on that? Why would a non-malicious peer set a TTL that is large enough to cause an overflow?

And it seems to me that there might be issues while decoding the MdnsPacket::new_from_bytes.

What issue?

@momoshell
Copy link
Author

momoshell commented Apr 6, 2025

Can you expand on that? Why would a non-malicious peer set a TTL that is large enough to cause an overflow?

My initial reaction is that the mDNS protocol itself shouldn't be responsible for handling security. If we're implementing changes that address security concerns, we should consider handling them at a higher layer, outside the mDNS protocol itself.

Completely failing the response would be overly harsh - we'd lose all the valid data in the response because of one overly enthusiastic TTL.
And if we filter out all nodes with problematic TTLs, it could potentially make debugging more difficult in the future. That’s why I believe capping the TTL is a more reasonable short-term fix, especially while we're still investigating the root cause of this issue.

What issue?

I think that we can't be sure whether the problematic TTLs are caused by malicious actors or if the responses were simply malformed. It seems important to investigate the creation of MdnsResponse to ensure that we're capturing the right information.

Finally, I would argue that it might be best to address this at the decoding stage rather than later in the process. This way, we can catch any malformed data earlier on and handle it more gracefully.

@elenaf9
Copy link
Member

elenaf9 commented Apr 21, 2025

I think that we can't be sure whether the problematic TTLs are caused by malicious actors or if the responses were simply malformed. It seems important to investigate the creation of MdnsResponse to ensure that we're capturing the right information.

Finally, I would argue that it might be best to address this at the decoding stage rather than later in the process. This way, we can catch any malformed data earlier on and handle it more gracefully.

If we packet is malformed we should discard the whole packet anyway. I don't think there is a way to recover from it. How are we supposed to know what other fields are correct?

@momoshell
Copy link
Author

momoshell commented Apr 22, 2025

This is turning into more of a design decision than we expected.
We’re seeing two possible paths here:

  • Discard everything due to an overly aggressive TTL.
  • Cap the TTL, retain the data, and rely on the existing protocol implementations to handle the security aspects.

We’re still not entirely sure why TTL values are coming in this way, and it feels a bit “nuclear” to filter everything out, because it seems that there is some hidden problem somewhere.
I feel that if we filter out everything, debugging this issue becomes a pretty hard thing to do, or somewhat impossible.

@elenaf9
Copy link
Member

elenaf9 commented May 5, 2025

I think that we can't be sure whether the problematic TTLs are caused by malicious actors or if the responses were simply malformed. It seems important to investigate the creation of MdnsResponse to ensure that we're capturing the right information.

Finally, I would argue that it might be best to address this at the decoding stage rather than later in the process. This way, we can catch any malformed data earlier on and handle it more gracefully.

We are using hickory_proto::Message::from_vec to decode a received packages, so this would be something to discuss at the upstream hickory-dns repo.

We’re still not entirely sure why TTL values are coming in this way, and it feels a bit “nuclear” to filter everything out, because it seems that there is some hidden problem somewhere.
I feel that if we filter out everything, debugging this issue becomes a pretty hard thing to do, or somewhat impossible.

We could add a debug message when discarding a packet and print the discarded package, wouldn't that solve this problem?
I don't think just capping the TTL makes debugging any easier, because then you wouldn't even discover that there was an invalid TTL (and thus potentially malformed packet) in the first place.

@momoshell
Copy link
Author

We are using hickory_proto::Message::from_vec to decode a received packages, so this would be something to discuss at the upstream hickory-dns repo.

Exactly, and this package is still in its alpha stage. That's why I was thinking about the ease of debugging.

We could add a debug message when discarding a packet and print the discarded package, wouldn't that solve this problem?
I don't think just capping the TTL makes debugging any easier, because then you wouldn't even discover that there was an invalid TTL (and thus potentially malformed packet) in the first place.

Your call on this one — I’m happy to jump in and help whichever way you go!
If I had to pick, I’d lean toward capping it and tossing in a debug message when it hits that point.

That said, your proposed solution works for me too — I just don’t have the same inside scoop on how you all usually like to handle these things.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug: Arithmetic overflow in mDNS implementation when adding TTL to Instant
2 participants